The boom of social networks has given rise to a large volume of user-generated contents (UGCs), most of which are freely and publicly available. The potential of using the rich set of UGCs to study people's personal attributes and personalized applications has been widely validated. Despite its value, UGCs can also place users at high privacy risks, which thus far remains largely untapped. Privacy is defined as the individual's ability to control what information is disclosed, to whom, when and under what circumstances. As people and information both play significant roles, privacy has been elaborated as a boundary regulation process, where individuals regulate interaction with others by altering the openness degree of themselves to others. In this paper, we aim to reduce users' privacy risks on social networks by answering the question of Who Can See What. Towards this goal, we propose a novel scheme to tackle the problem of boundary regulation comprising of descriptive, predictive and prescriptive components, as shown in Fig.1. In particular, we first collect a set of posts and extract a rich set of privacy-oriented features to describe the posts. We then proposed a novel taxonomy-guided multi-task learning model to identify what kind of personal aspects are uncovered by the given posts. At last, we constructed standard guidelines by 400 users to regularize users' actions for preventing their privacy leakage. Extensive experiments on a real-world dataset have well verified our scheme. We have released the data, code and parameters to facilitate the research community.


Fig1.Illustration of the proposed scheme for boundary regulation. In the first component, we build a comprehensive taxonomy of the personal aspects, collect a benchmark dataset from Twitter and extract a rich set of features to describe the UGCs. The second component presents a taxonomy-constrained model to detect whether the given post leak certain personal aspects. In the last component, according to the guidelines built via AMT, we suggest users what they should do.

Data Collection

To build our benchmark dataset, we collected the social posts for each category in the pre-defined taxonomy, respectively. In particular, we leveraged Twitter search service. In the light of this, we obtained 269,090 raw tweets.


Feature nameDescriptionLink
LIWC LIWC, short for Linguistic Inquiry Word Count, is a psycholinguistic transparent lexicon analysis tool. We adopted the LIWC feature to capture the sensitivity of a given UGC. Data
Sentence2Vector In our work, we treated each tweet as a sentence, and utilized the Sentense2Vec tool to generate a fixed dimention (100) vector representation of each tweet. Data
Metadata We extracted several metadata features, such as the presence of hashtags, images, emojis and user mentions. Moreover, we also incorporated the timestamp as an important feature. Data
Privacy Dictionary A privacy dictionary, is a new linguistic resource for automated content analysis on privacy related texts. With the help of this dictionary, we can generate a 9-dimensional features. Data
Sentiment We utilized the Stanford NLP sentiment classifer to judge tweets' polarity. We assign each tweet with a value ranging from 0 to 4, corresponding to very negative, negative, neutral, positive, very positive. Data

Ground Truth

In our work, we constructed the ground truth about what has been revealed by a given post via Amazon Mechanical Turk. Using majority voting strategy to establish the final labels of each post, we finally obtained 11,368 labeled posts. The ground truth can be accessed by this.


We thus conducted a user study via AMT to build guidelines regrading disclosure norms of different circles. Considering the existence of cultural difference, we launched a cross-cultural survey within two distinct areas: the U.S. and Asia. The complete results are listed as follows.

Aspects The U.S. Asia
home address1911421061611463612
negative emotion17718711973981556240
passing away19117370241761496215
relationship status18919215210414617010463
current location18917563181501557023
self promotion196193188171150162141131
health conditions19614235141701303915
specific complaints1071565635731365638
places planning to go190183102431541747827
activities outside of home and work188189123421321629544
full name19417994351531568852
relationship status change179192102381521677023
have babies191193127491801568126
activities at home190186123701581376726
activities at work188184111521181659633
career promotion193189125561671689141
general complaints18418816711913415810464
positive emotion1901951661081511738942